Open Tabs
- 08-visualization-plotly.ipynb
- 09-visualization-seaborn.ipynb
- 10-databases-sql.ipynb
Kernels
- 17-ts-core.ipynb
- 031-data-wrangling-with-mongodb.ipynb
- 04-pandas-advanced.ipynb
- 18-ts-models.ipynb
- 11-databases-mongodb.ipynb
- 10-databases-sql.ipynb
- 033-autoregressive-models.ipynb
- 035-assignment.ipynb
- 032-linear-regression-with-time-series-data.ipynb
- 034-arma-models-and-hyperparameter-tuning.ipynb
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 02-python-advanced.ipynb
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
- 07-visualization-pandas.ipynb
- 08-visualization-plotly.ipynb
- 09-visualization-seaborn.ipynb
Terminals
- .ipynb_checkpointsan hour ago
- data2 months ago
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynban hour ago
- 01-python-getting-started.ipynb2 months ago
- 02-python-advanced.ipynb2 hours ago
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb2 months ago
- 03-pandas-getting-started.ipynb2 months ago
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynban hour ago
- 04-pandas-advanced.ipynb2 months ago
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynban hour ago
- 05-pandas-summary-statistics.ipynb2 months ago
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb33 minutes ago
- 06-visualization-matplotlib.ipynb2 months ago
- 07-visualization-pandas.ipynb8 minutes ago
- 08-visualization-plotly.ipynba minute ago
- 09-visualization-seaborn.ipynba month ago
- 10-databases-sql.ipynb2 months ago
- 11-databases-mongodb.ipynb19 days ago
- 12-ml-core.ipynb2 months ago
- 13-ml-data-pre-processing-and-production.ipynba month ago
- 14-ml-classification.ipynb2 months ago
- 15-ml-regression.ipynba month ago
- 16-ml-unsupervised-learning.ipynb2 months ago
- 17-ts-core.ipynb2 months ago
- 18-ts-models.ipynb2 months ago
- 19-linux-command-line.ipynb2 months ago
- 20-statistics.ipynb2 months ago
- 21-python-object-oriented-programming.ipynb2 months ago
- 22-apis.ipynb2 months ago
- main.py3 months ago
- 08-visualization-plotly.ipynb
- 09-visualization-seaborn.ipynb
- 10-databases-sql.ipynb
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: plotly express</strong></font>Visualizing Data: plotly express
xxxxxxxxxxThere are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), plotly express, and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using plotly express.There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using plotly express.
xxxxxxxxxx# Scatter PlotsScatter Plots¶
xxxxxxxxxxA **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**.A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.
import pandas as pdmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxAfter cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.
import plotly.express as pxfig = px.scatter(mexico_city1, x="price", y="surface_covered_in_m2")fig.show()xxxxxxxxxx<font size="+1">Practice</font> Practice
Plot the scatter plot for column "price" and "surface_total_in_m2".
xxxxxxxxxx# 3D Scatter Plots3D Scatter Plots¶
Scatter plots can summarize information in a DataFrame. Three dimensional scatter plots look great, but be careful: it can be difficult for people who might not be sure what they're looking at to accurately determine values of points in the plot. Still, scatter plots are useful for displaying relationships between three quantities that would be more difficult to observe in a two dimensional plot.
Let's take a look at the first 50 rows of the mexico-city-real-estate-1.csv dataset.
import pandas as pdimport plotly.express as pxmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1[ ["First Empty", "Country", "City", "Borough", "Second Empty"]] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)mexico_city1_subset = mexico_city1.loc[1:50]fig = px.scatter_3d( mexico_city1_subset, x="Borough", y="surface_covered_in_m2", z="price", symbol="property_type", color="property_type", labels={ "surface_covered_in_m2": "Surface Covered in m^2", "price": "Price", "property_type": "Property Type", },)fig.show()xxxxxxxxxxNotice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.
xxxxxxxxxx<font size="+1">Practice</font> Practice
Modify the DataFrame to include columns for the base 10 log of price and surface_covered_in_m2 and then plot these for the entire mexico-city-real-estate-1.csv dataset.
import mathxxxxxxxxxx# Mapbox Scatter PlotsMapbox Scatter Plots¶
xxxxxxxxxxA **mapbox scatter plot** is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our `mexico-city-real-estate-1.csv` dataset is a good place to start, because it includes **location data**. After importing the dataset and removing rows with missing data, split the `lat-lon` column into two separate columns: one for `latitude` and the other for `longitude`. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.A mapbox scatter plot is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our mexico-city-real-estate-1.csv dataset is a good place to start, because it includes location data. After importing the dataset and removing rows with missing data, split the lat-lon column into two separate columns: one for latitude and the other for longitude. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1[["latitude", "longitude"]] = mexico_city1["lat-lon"].str.split( ",", 2, expand=True)mexico_city1["latitude"] = mexico_city1["latitude"].astype(float)mexico_city1["longitude"] = mexico_city1["longitude"].astype(float)fig = px.scatter_mapbox( mexico_city1, lat="latitude", lon="longitude", color="property_type", mapbox_style="carto-positron", labels={"property_type": "Property Type"}, title="Distribution of Property Types for Sale in Mexico City",)fig.show()/tmp/ipykernel_659/3692783844.py:6: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
xxxxxxxxxx<font size="+1">Practice</font> Practice
Create another column in the DataFrame with a log scale of the prices. Then create three separate plots, one for stores, another for houses, and a final one for apartments. Color the points in the plots by the log of the price.
from math import log10xxxxxxxxxx# Choropleth MapsChoropleth Maps¶
xxxxxxxxxxA Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called `px.choropleth` that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set. A Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called px.choropleth that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set.
xxxxxxxxxx# Create Synthetic Datasetdf = pd.DataFrame.from_dict( {"State": ["CA", "TX", "NY", "HI", "DE"], "Temparature": [100, 120, 110, 90, 105]})df| State | Temparature | |
|---|---|---|
| 0 | CA | 100 |
| 1 | TX | 120 |
| 2 | NY | 110 |
| 3 | HI | 90 |
| 4 | DE | 105 |
xxxxxxxxxx# Plot the data set in US mapfig = px.choropleth( df, locations="State", locationmode="USA-states", color="Temparature", scope="usa")fig.show()xxxxxxxxxx# HistogramHistogram¶
xxxxxxxxxxA **histogram** is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the `px.histogram()` function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:A histogram is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the px.histogram() function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:
xxxxxxxxxximport plotly.express as pxdf = pd.read_csv("data/mexico-city-real-estate-1.csv")fig = px.histogram(df, x="price")fig.show()xxxxxxxxxx<font size="+1">Practice</font> Practice
Check the "surface_covered_in_m2" Histogram.
xxxxxxxxxx# BoxplotsBoxplots¶
xxxxxxxxxxA **boxplot** is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.A boxplot is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.
xxxxxxxxxx# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxCheck the boxplot for column `"price"`:Check the boxplot for column "price":
xxxxxxxxxxIf you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an `x` argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an x argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:
xxxxxxxxxx<font size="+1">Practice</font> Practice
Check the "surface_covered_in_m2" distribution by property types.
xxxxxxxxxxfig = ...fig.show()--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [15], line 2 1 fig = ... ----> 2 fig.show() AttributeError: 'ellipsis' object has no attribute 'show'
xxxxxxxxxx# Bar ChartBar Chart¶
xxxxxxxxxxA **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.
In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to read the data set and extract Borough and other location information from column "place_with_parent_names".
xxxxxxxxxx# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)# find location columns from place_with_parent_namesmexico_city1[ ["First Empty", "Country", "City", "Borough", "Second Empty"]] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]mexico_city1.head()xxxxxxxxxxWe can calculate the number of real estate showing in the data set by Borough using `value_counts()`, then plot it as bar plot:We can calculate the number of real estate showing in the data set by Borough using value_counts(), then plot it as bar plot:
xxxxxxxxxx# Use value_counts() to get the datamexico_city1["Borough"].value_counts()xxxxxxxxxx# Plot value_counts() datafig = px.bar(mexico_city1["Borough"].value_counts())fig.show()xxxxxxxxxxWe can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use `groupby` to calculate the aggregated counts for each Borough and property type combination:We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use groupby to calculate the aggregated counts for each Borough and property type combination:
xxxxxxxxxxsize_df = mexico_city1.groupby(["Borough", "property_type"], as_index=False).size()size_df.head()xxxxxxxxxxBy specifying `x`, `y` and `color`, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note `y` has to be numerical, while `x` and `color` are usually categorical variables.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>By specifying x, y and color, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note y has to be numerical, while x and color are usually categorical variables.WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxxfig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="relative")fig.show()xxxxxxxxxxNote the argument `barmode` is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.Note the argument barmode is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.
xxxxxxxxxxfig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="overlay")fig.show()xxxxxxxxxxIf we want bars to be placed beside each other, we can specify `barmode` as "group":If we want bars to be placed beside each other, we can specify barmode as "group":
xxxxxxxxxxfig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="group")fig.show()xxxxxxxxxx<font size="+1">Practice</font> Practice
Plot bar plot for the number of observations by property types in "mexico-city-real-estate-1.csv".
xxxxxxxxxxbar_df = ...fig = ...fig.show()xxxxxxxxxx# References and Further ReadingReferences and Further Reading¶
- Official plotly express Documentation on Scatter Plots
- Official plotly Express Documentation on 3D Plots
- Official plotly Documentation on Notebooks
- plotly Community Forum Post on Axis Labeling
- plotly express Official Documentation on Tile Maps
- plotly Choropleth Maps in Python Document
- plotly express Official Documentation on Figure Display
- Online Tutorial on String Conversion in Pandas
- Official Pandas Documentation on using Lambda Functions on a Column
- Official Seaborn Documentation on Generating a Heatmap
- Online Tutorial on Correlation Matrices in Pandas
- Official Pandas Documentation on Correlation Matrices
- Official Matplotlib Documentation on Colormaps
- Official Pandas Documentation on Box Plots
- Online Tutorial on Box Plots
- Online Tutorial on Axes Labels in Seaborn and Matplotlib
- Matplotlib Gallery Example of an Annotated Heatmap
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: seaborn</strong></font>Visualizing Data: seaborn
xxxxxxxxxxThere are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), [plotly express](../%40textbook/08-visualization-plotly.ipynb), and seaborn. In this section, we'll focus on using seaborn.There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using seaborn.
xxxxxxxxxx# Scatter PlotsScatter Plots¶
xxxxxxxxxxA **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**. A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.
In the following example, we will see some scatter plots based on the Mexico City real estate data. Specifically, we can use scatter plot to show how "price" and "surface_covered_in_m2" are correlated. First we need to read the data set and do a little cleaning.
import pandas as pdimport seaborn as sns# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxUse seaborn to plot the scatter plot for `"price"` and `"surface_covered_in_m2"`:Use seaborn to plot the scatter plot for "price" and "surface_covered_in_m2":
sns.scatterplot(data=mexico_city1, x="price", y="surface_covered_in_m2");xxxxxxxxxxThere is a very useful argument in `scatterplot` called `hue`. By specifying a categorical column as `hue`, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using `"property_type"`:There is a very useful argument in scatterplot called hue. By specifying a categorical column as hue, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using "property_type":
sns.scatterplot( data=mexico_city1, x="price", y="surface_covered_in_m2", hue="property_type");xxxxxxxxxx<font size="+1">Practice</font>Practice
Plot a scatter plot for "price" and "surface_total_in_m2" by "property_type" for "mexico-city-real-estate-1.csv":
xxxxxxxxxx# Bar ChartsBar Charts¶
xxxxxxxxxxA **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.
In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to import the dataset and extract the borough and other location information from column "place_with_parent_names".
xxxxxxxxxx# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)# find location columns from place_with_parent_namesmexico_city1[ ["First Empty", "Country", "City", "Borough", "Second Empty"]] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]mexico_city1.head()/tmp/ipykernel_721/836102575.py:12: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | Country | City | Borough | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... | México | Distrito Federal | Cuauhtémoc |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... | México | Distrito Federal | Cuauhtémoc |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... | México | Distrito Federal | Miguel Hidalgo |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... | México | Distrito Federal | Gustavo A. Madero |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... | México | Distrito Federal | Álvaro Obregón |
xxxxxxxxxxLet's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:
bar_df = pd.DataFrame(mexico_city1["Borough"].value_counts()).reset_index()bar_df| index | Borough | |
|---|---|---|
| 0 | Miguel Hidalgo | 345 |
| 1 | Cuajimalpa de Morelos | 255 |
| 2 | Álvaro Obregón | 203 |
| 3 | Benito Juárez | 198 |
| 4 | Tlalpan | 171 |
| 5 | Iztapalapa | 134 |
| 6 | Tláhuac | 125 |
| 7 | Cuauhtémoc | 120 |
| 8 | Gustavo A. Madero | 89 |
| 9 | Venustiano Carranza | 81 |
| 10 | Coyoacán | 80 |
| 11 | La Magdalena Contreras | 41 |
| 12 | Xochimilco | 34 |
| 13 | Iztacalco | 27 |
| 14 | Azcapotzalco | 24 |
| 15 | Milpa Alta | 1 |
xxxxxxxxxxSince there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:
# Increase plot sizesns.set(rc={"figure.figsize": (15, 4)})# Plot the bar plotax = sns.barplot(data=bar_df, x="index", y="Borough")# Rotate the x axisax.set_xticklabels(ax.get_xticklabels(), rotation=75)[Text(0, 0, 'Miguel Hidalgo'), Text(1, 0, 'Cuajimalpa de Morelos'), Text(2, 0, 'Álvaro Obregón'), Text(3, 0, 'Benito Juárez'), Text(4, 0, 'Tlalpan'), Text(5, 0, 'Iztapalapa'), Text(6, 0, 'Tláhuac'), Text(7, 0, 'Cuauhtémoc'), Text(8, 0, 'Gustavo A. Madero'), Text(9, 0, 'Venustiano Carranza'), Text(10, 0, 'Coyoacán'), Text(11, 0, 'La Magdalena Contreras'), Text(12, 0, 'Xochimilco'), Text(13, 0, 'Iztacalco'), Text(14, 0, 'Azcapotzalco'), Text(15, 0, 'Milpa Alta')]
xxxxxxxxxx<font size="+1">Practice</font>Practice
Plot a bar plot showing the value counts for property types in "mexico-city-real-estate-1.csv":
xxxxxxxxxxpro_typ_df = pd.DataFrame(mexico_city1["property_type"].value_counts()).reset_index()pro_typ_dfsns.barplot(data =pro_typ_df,x="index",y="property_type")<AxesSubplot:xlabel='index', ylabel='property_type'>
xxxxxxxxxx# Correlation HeatmapsCorrelation Heatmaps¶
A correlation heatmap shows the relative strength of correlations between the variables in a dataset. Here's what the code looks like:
xxxxxxxxxximport pandas as pdimport seaborn as snsmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1_numeric = mexico_city1.select_dtypes(include="number")corr = mexico_city1_numeric.corr(method="kendall")sns.heatmap(corr)<AxesSubplot:>
xxxxxxxxxxNotice that we dropped the columns and rows with missing entries before plotting the graph.Notice that we dropped the columns and rows with missing entries before plotting the graph.
This heatmap is showing us what we might already have suspected: the price is moderately positively correlated with the size of the properties.
xxxxxxxxxx<font size="+1">Practice</font>Practice
The seaborn documentation on heat maps indicates how to add numeric labels to each cell and how to use a different colormap. Modify the plot to use the viridis colormap, have a linewidth of 0.5 between each cell and have numeric labels for each cell.
xxxxxxxxxxxxxxxxxxxx# References and Further ReadingReferences and Further Reading¶
- Official Plotly Express Documentation on Scatter Plots
- Official Plotly Express Documentation on 3D Plots
- Official Plotly Documentation on Notebooks
- Plotly Community Forum Post on Axis Labeling
- Plotly Express Official Documentation on Tile Maps
- Plotly Express Official Documentation on Figure Display
- Online Tutorial on String Conversion in Pandas
- Official Pandas Documentation on using Lambda Functions on a Column
- Official seaborn Documentation on Generating a Heatmap
- Online Tutorial on Correlation Matrices in Pandas
- Official Pandas Documentation on Correlation Matrices
- Official Matplotlib Documentation on Colormaps
- Official Pandas Documentation on Box Plots
- Online Tutorial on Box Plots
- Online Tutorial on Axes Labels in seaborn and Matplotlib
- Matplotlib Gallery Example of an Annotated Heatmap
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>Databases: SQL</strong></font>Databases: SQL
from IPython.display import YouTubeVideoxxxxxxxxxx# Working with SQL DatabasesWorking with SQL Databases¶
xxxxxxxxxxA database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a **relational database**. A relational database is a widely used database model that consists of a collection of uniquely named **tables** used to store information. The structure of a database model with its tables, constraints, and relationships is called a **schema**. A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a relational database. A relational database is a widely used database model that consists of a collection of uniquely named tables used to store information. The structure of a database model with its tables, constraints, and relationships is called a schema.
A Structured Query Language (SQL), is used to retrieve information from a relational database. SQL is one of the most commonly used database languages. It allows data stored in a relational database to be queried, modified, and manipulated easily with basic commands. SQL powers database engines like MySQL, SQL Server, SQLite, and PostgreSQL. The examples and projects in this course will use SQLite.
A table refers to a collection of rows and columns in a relational database. When reading data into a pandas DataFrame, an index can be defined, which acts as the label for every row in the DataFrame.
xxxxxxxxxx# Connecting to a DatabaseConnecting to a Database¶
xxxxxxxxxx## ipython-sql ipython-sql¶
xxxxxxxxxx### Magic CommandsMagic Commands¶
xxxxxxxxxxJupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:
| Magic Command | Description of Command |
|---|---|
%pwd |
Print the current working directory |
%cd |
Change the current working directory |
%ls |
List the contents of the current directory |
%history |
Show the history of the In [ ]: commands |
We will be leveraging magic commands to work with a SQLite database.
xxxxxxxxxx### ipython-sqlipython-sql¶
xxxxxxxxxx`ipython-sql` allows you to write SQL code directly in a Jupyter Notebook. The `%sql` (or `%%sql`) magic command is added to the beginning of a code block and then SQL code can be written.ipython-sql allows you to write SQL code directly in a Jupyter Notebook. The %sql (or %%sql) magic command is added to the beginning of a code block and then SQL code can be written.
xxxxxxxxxx### Connecting with ipython-sqlConnecting with ipython-sql¶
xxxxxxxxxxWe can connect to a database using the %sql magic function:We can connect to a database using the %sql magic function:
%load_ext sql%sql sqlite:////home/jovyan/nepal.sqlitexxxxxxxxxx## sqlite3sqlite3¶
xxxxxxxxxxWe can also connect to the same database using the sqlite3 package:We can also connect to the same database using the sqlite3 package:
import sqlite3conn = sqlite3.connect("/home/jovyan/nepal.sqlite")xxxxxxxxxx# Querying a DatabaseQuerying a Database¶
xxxxxxxxxx## Building Blocks of the Basic QueryBuilding Blocks of the Basic Query¶
xxxxxxxxxxThere are six common clauses used for querying data:There are six common clauses used for querying data:
| Clause Name | Definition |
|---|---|
SELECT |
Determines which columns to include in the query's result |
FROM |
Identifies the table from which to query the data from |
WHERE |
filters data |
GROUP BY |
groups rows by common values in columns |
HAVING |
filters out unwanted groups from GROUP BY |
ORDER BY |
Orders the rows using one or more columns |
LIMIT |
Outputs the specified number of rows |
All clauses may be used together, but SELECT and FROM are the only required clauses. The format of clauses is in the example query below:
SELECT column1, column2
FROM table_name
WHERE "conditions"
GROUP BY "column-list"
HAVING "conditions"
ORDER BY "column-list"
xxxxxxxxxx## SELECT and FROMSELECT and FROM¶
xxxxxxxxxxYou can use `SELECT *` to select all columns in a table. `FROM` specifies the table in the database to query. `LIMIT 5` will select only the first five rows. You can use SELECT * to select all columns in a table. FROM specifies the table in the database to query. LIMIT 5 will select only the first five rows.
Example
xxxxxxxxxx%%sqlSELECT *FROM id_mapLIMIT 5xxxxxxxxxxYou can also use `SELECT` to select certain columns in a tableYou can also use SELECT to select certain columns in a table
xxxxxxxxxx%%sqlSELECT household_id, building_idFROM id_mapLIMIT 5xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use SELECT to select the district_id column from the id_map table.
xxxxxxxxxx%%sqlxxxxxxxxxxWe can also assign an **alias** or temporary name to a column using the `AS` command. Aliases can also be used on a table. See the example below, which assigns the alias `household_number` to `household_id`We can also assign an alias or temporary name to a column using the AS command. Aliases can also be used on a table. See the example below, which assigns the alias household_number to household_id
xxxxxxxxxx%%sqlSELECT household_id AS household_number, building_idFROM id_mapLIMIT 5xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use SELECT, FROM, AS, and LIMIT to select the first 5 rows from the id_map table. Rename the district_id column to district_number.
xxxxxxxxxx%%sqlxxxxxxxxxx## Filtering and Sorting DataFiltering and Sorting Data¶
xxxxxxxxxxSQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data. SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data.
| Comparison Operator | Description |
|---|---|
| = | Equal |
| > | Greater than |
| < | Less than |
| >= | Greater than or equal to |
| <= | Less than or equal to |
| <> or != | Not equal to |
| LIKE | String comparison test |
xxxxxxxxxxFor example, to select the first 5 homes in Ramechhap (district `2`):For example, to select the first 5 homes in Ramechhap (district 2):
xxxxxxxxxx%%sqlxxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use WHERE to select the row with household_id equal to 13735001
xxxxxxxxxx%%sqlxxxxxxxxxx## Aggregating DataAggregating Data¶
xxxxxxxxxxAggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:
| Aggregation Function | Definition |
|---|---|
MIN |
Return the minimum value |
MAX |
Return the largest value |
SUM |
Return the sum of values |
AVG |
Return the average of values |
COUNT |
Return the number of observations |
xxxxxxxxxxUse the `COUNT` function to find the number of observations in the `id_map` table that come from Ramechhap (district `2`):Use the COUNT function to find the number of observations in the id_map table that come from Ramechhap (district 2):
xxxxxxxxxx%%sqlSELECT count(*)FROM id_mapWHERE district_id = 2xxxxxxxxxxAggregation functions are frequently used with a `GROUP BY` clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:Aggregation functions are frequently used with a GROUP BY clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:
xxxxxxxxxx%%sqlSELECT district_id, count(*)FROM id_mapGROUP BY district_idxxxxxxxxxx `DISTINCT` is a keyword to select unique records in a query result. For example, if we want to know the unique values in the `district_id` column: DISTINCT is a keyword to select unique records in a query result. For example, if we want to know the unique values in the district_id column:
xxxxxxxxxx%%sqlSELECT distinct(district_id)FROM id_mapxxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use DISTINCT to count the number of unique values in the vdcmun_id column.
xxxxxxxxxx%%sqlxxxxxxxxxx`DISTINCT` and `COUNT` can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the `district_id` column:DISTINCT and COUNT can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the district_id column:
xxxxxxxxxx%%sqlSELECT count(distinct(district_id))FROM id_mapxxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use DISTINCT and COUNT to count the number of unique values in the vdcmun_id column.
xxxxxxxxxx%%sqlxxxxxxxxxx# Joining TablesJoining Tables¶
xxxxxxxxxxJoins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where `table1` and `table2` refer to the two tables being joined, `column1` and `column2` refer to columns to be returned from both tables, and `ID` refers to the common column in the two tables. Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where table1 and table2 refer to the two tables being joined, column1 and column2 refer to columns to be returned from both tables, and ID refers to the common column in the two tables.
SELECT table1.column1,
table2.column2
FROM table_1
JOIN table2 ON table1.id = table1.id
xxxxxxxxxxWe'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding `foundation_type` for the first home in Ramechhap (District 1). We'll start by looking at this single record in the `id_map` table.We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding foundation_type for the first home in Ramechhap (District 1). We'll start by looking at this single record in the id_map table.
xxxxxxxxxx%%sqlSELECT *FROM id_mapWHERE district_id = 2LIMIT 1xxxxxxxxxxThis household has `building_id` equal to 23. Let's look at the `foundation_type` for this building, by filtering the `building_structure` table to find this building.This household has building_id equal to 23. Let's look at the foundation_type for this building, by filtering the building_structure table to find this building.
xxxxxxxxxx%%sqlSELECT building_id, foundation_typeFROM building_structureWHERE building_id = 23xxxxxxxxxxTo join the two tables and limit the results to `building_id = 23`: To join the two tables and limit the results to building_id = 23:
xxxxxxxxxx%%sqlSELECT id_map.*, building_structure.foundation_typeFROM id_mapJOIN building_structure ON id_map.building_id = building_structure.building_idWHERE id_map.building_id = 23xxxxxxxxxxIn addition to the basic `JOIN` clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the `FROM` clause and the right table is the table specified after the `JOIN` clause. If the generic `JOIN` clause is used, then by default the `INNER JOIN` will be used.In addition to the basic JOIN clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the FROM clause and the right table is the table specified after the JOIN clause. If the generic JOIN clause is used, then by default the INNER JOIN will be used.
| JOIN Type | Definition |
|---|---|
INNER JOIN |
Returns rows where ID is in both tables |
LEFT JOIN |
Returns rows where ID is in the left table. Return NA for values in column, if ID is not in right table. |
RIGHT JOIN |
Returns rows where ID is in the right table. Return NA for values in column, if ID is not in left table. |
FULL JOIN |
Returns rows where ID is in either table. Return NA for values in column, if ID is not in either table. |
| WQU WorldQuant University Applied Data Science Lab QQQQ |
xxxxxxxxxxThe video below outlines the main types of joins:The video below outlines the main types of joins:
xxxxxxxxxxYouTubeVideo("2HVMiPPuPIM")xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use the DISTINCT command to create a column with all unique building IDs in the id_map table. LEFT JOIN this column with the roof_type column from the building_structure table, showing only buildings where district_id is 1 and limiting your results to the first five rows of the new table.
xxxxxxxxxx%%sqlxxxxxxxxxx# Using pandas with SQL DatabasesUsing pandas with SQL Databases¶
xxxxxxxxxxTo save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:
xxxxxxxxxximport sqlite3conn = sqlite3.connect("/home/jovyan/nepal.sqlite")xxxxxxxxxxTo run a query using `sqlite3`, we need to store the query as a string. For example, the variable below called `query` is a string containing a query which returns the first 10 rows from the `id_map` table:To run a query using sqlite3, we need to store the query as a string. For example, the variable below called query is a string containing a query which returns the first 10 rows from the id_map table:
xxxxxxxxxxquery = """ SELECT * FROM id_map LIMIT 10 """xxxxxxxxxxTo save the results of the query into a pandas DataFrame, use the `pd.read_sql()` function. The optional parameter `index_col` can be used to set the index to a specific column from the query. To save the results of the query into a pandas DataFrame, use the pd.read_sql() function. The optional parameter index_col can be used to set the index to a specific column from the query.
xxxxxxxxxximport pandas as pddf = pd.read_sql(query, conn, index_col="building_id")df.head()xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use the pd.read_sql function to save the results of a query to a DataFrame. The query should select first 20 rows from the id_map table.
xxxxxxxxxxquery = ...df2 = ...df2.head()xxxxxxxxxx# References & Further Readingxxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx- Working with SQL Databases
- Connecting to a Database
- ipython-sql
- Magic Commands
- ipython-sql
- Connecting with ipython-sql
- sqlite3
- Querying a Database
- Building Blocks of the Basic Query
- SELECT and FROM
- Filtering and Sorting Data
- Aggregating Data
- Joining Tables
- Using pandas with SQL Databases
- References & Further Reading